Camera interfaces| page 4

Reply by George Neuner ●December 31, 20222022-12-31

> [Hope you are faring well... enjoying the COLD!  ;) ]

Not very. Don't think I have your latest email.


On Fri, 30 Dec 2022 14:59:39 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 12/30/2022 11:02 AM, Richard Damon wrote:
>>>>> So, my options are:
>>>>> - reduce the overall frame rate such that N cameras can
>>>>> &nbsp;&nbsp; be serviced by the USB (or whatever) interface *and*
>>>>> &nbsp;&nbsp; the processing load
>>>>> - reduce the resolution of the cameras (a special case of the above)
>>>>> - reduce the number of cameras "per processor" (again, above)
>>>>> - design a "camera memory" (frame grabber) that I can install
>>>>> &nbsp;&nbsp; multiply on a single host
>>>>> - develop distributed algorithms to allow more bandwidth to
>>>>> &nbsp;&nbsp; effectively be applied
>>>>
>>>> The fact that you are starting for the concept of using "USB Cameras" sort 
>>>> of starts you with that sort of limit.
>>>>
>>>> My personal thought on your problem is you want to put a "cheap" processor 
>>>> right on each camera using a processor with a direct camera interface to 
>>>> pull in the image and do your processing and send the results over some 
>>>> comm-link to the center core.
>>>
>>> If I went the frame-grabber approach, that would be how I would address the
>>> hardware.&nbsp; But, it doesn't scale well.&nbsp; I.e., at what point do you throw in
>>> the towel and say there are too many concurrent images in the scene to
>>> pile them all onto a single "host" processor?
>> 
>> Thats why I didn't suggest that method. I was suggesting each camera has its 
>> own tightly coupled processor that handles the need of THAT
>
>My existing "module" handles a single USB camera (with a fairly heavy-weight
>processor).
>
>But, being USB-based, there is no way to look at *part* of an image.
>And, I have to pay a relatively high cost (capturing the entire
>image from the serial stream) to look at *any* part of it.
>
>*If* a "camera memory" was available, I would site N of these
>in the (64b) address space of the host and let the host pick
>and choose which parts of which images it wanted to examine...
>without worrying about all of the bandwidth that would have been
>consumed deserializing those N images into that memory (which is
>a continuous process)

That's the way all cameras work - at least low level.  The camera
captures a field (or a frame, depending) on its CCD, and then the CCD
pixel data is read out serially by a controller.

What you are looking for is some kind of local frame buffering at the
camera.  There are some "smart" cameras that provide that ... and also
generally a bunch of image analysis functions that you may or may not
find useful.  I haven't played with any of them in a long time, and
when I did the image functions were too primitive for my purpose, so I
really can't recommend anything.


>>> ISTM that the better solution is to develop algorithms that can
>>> process portions of the scene, concurrently, on different "hosts".
>>> Then, coordinate these "partial results" to form the desired result.
>>>
>>> I already have a "camera module" (host+USB camera) that has adequate
>>> processing power to handle a "single camera scene".&nbsp; But, these all
>>> assume the scene can be easily defined to fit in that camera's field
>>> of view.&nbsp; E.g., point a camera across the path of a garage door and have
>>> it "notice" any deviation from the "unobstructed" image.
>> 
>> And if one camera can't fit the full scene, you use two cameras, each with 
>> there own processor, and they each process their own image.
>
>That's the above approach, but...
>
>> The only problem is if your image processing algoritm need to compare parts of 
>> the images between the two cameras, which seems unlikely.
>
>Consider watching a single room (e.g., a lobby at a business) and
>tracking the movements of "visitors".  It's unlikely that an individual's
>movements would always be constrained to a single camera field.  There will
>be times when he/she is "half-in" a field (and possibly NOT in the other,
>HALF in the other or ENTIRELY in the other).  You can't ignore cases where
>the entire object (or, your notion of what that object's characteristics
>might be) is not entirely in the field as that leaves a vulnerability.

I've done simple cases following objects from one camera to another,
but not dealing with different angles/points of view - the cameras had
contiguous views with a bit of overlap.  That made it relatively easy.

Following a person, e.g., seen quarter-behind in one camera, and
tracking them to another camera that sees a side view - from the
/other/ side - 

Just following a person is easy, but tracking a specific person,
particularly when multiple people are present, gets very complicated
very quickly.


>For example, I watch our garage door with *four* cameras.  A camera is
>positioned on each side ("door jam"?) of the door "looking at" the other
>camera.  This because a camera can't likely see the full height of the door
>opening ON ITS SIDE OF THE DOOR (so, the opposing camera watches "my side"
>and I'll watch *its* side!).
>
>[The other two cameras are similarly positioned on the overhead *track*
>onto which the door rolls, when open]
>
>An object in (or near) the doorway can be visible in one (either) or
>both cameras, depending on where it is located.  Additionally, one of
>those manifestations may be only "partial" as regards to where it is
>located and intersects the cameras' fields of view.
>
>The "cost" of watching the door is only the cost of the actual *cameras*.
>The cost of the compute resources is amortized over the rest of the system
>as those can be used for other, non-camera, non-garage related activities.
>
>> It does say that if trying to track something across the cameras, you need 
>> enough overlap to allow them to hand off the object when it is in the overlap.
>
>And, objects that consume large portions of a camera's field of view
>require similar handling (unless you can always guarantee that cameras
>and targets are "far apart")
>
>>> When the scene gets too large to represent in enough detail in a single
>>> camera's field of view, then there needs to be a way to coordinate
>>> multiple cameras to a single (virtual?) host.&nbsp; If those cameras were just
>>> "chunks of memory", then the *imagery* would be easy to examine in a single
>>> host -- though the processing power *might* need to increase geometrically
>>> (depending on your current goal)
>> 
>> Yes, but your "chunks of memory" model just doesn't exist as a viable camera 
>> model.
>
>Apparently not -- in the COTS sense.  But, that doesn't mean I can't
>build a "camera memory emulator".
>
>The downside is that this increases the cost of the "actual camera"
>(see my above comment wrt ammortization).
>
>And, it just moves the point at which a single host (of fixed capabilities)
>can no longer handle the scene's complexity.  (when you have 10 cameras?)
>
>> The CMOS cameras with addressable pixels have "access times" significantly 
>> lower than your typical memory (and is read once) so doesn't really meet that 
>> model. Some of them do allow for sending multiple small regions of intererst 
>> and down loading just those regions, but this then starts to require moderate 
>> processor overhead to be loading all these regions and updating the grabber to 
>> put them where you want.
>
>You would, instead, let the "camera memory emulator" capture the entire
>image from the camera and place the entire image in a contiguous
>region of memory (from the perspective of the host).  The cost of capturing
>the portions that are not used is hidden *in* the cost of the "emulator".
>
>> And yes, it does mean that there might be some cases where you need a core 
>> module that has TWO cameras connected to a single processor, either to get a 
>> wider field of view, or to combine two different types of camera (maybe a high 
>> res black and white to a low res color if you need just minor color 
>> information, or combine a visible camera to a thermal camera). These just 
>> become another tool in your tool box.
>
>I *think* (uncharted territory) that the better investment is to develop
>algorithms that let me distribute the processing among multiple
>(single) "camera modules/nodes".  How would your "two camera" exemplar
>address an application requiring *three* cameras?  etc.
>
>I can, currently, distribute this processing by treating the
>region of memory into which a (local) camera's imagery is
>deserialized as a "memory object" and then exporting *access*
>to that object to other similar "camera modules/nodes".
>
>But, the access times of non-local memory are horrendous, given
>that the contents are ephemeral (if accesses could be *cached*
>on each host needing them, then these costs diminish).
>
>So, I need to come up with algorithms that let me export abstractions
>instead of raw data.
>
>>> Moving the processing to "host per camera" implementation gives you more
>>> MIPS.&nbsp; But, makes coordinating partial results tedious.
>> 
>> Depends on what sort of partial results you are looking at.
>
>"Bob's *head* is at X,Y+H,W in my image -- but, his body is not visible"
>
>"Ah!  I was wondering whose legs those were in *my* image!"
>
>>>> It is unclear what you actual image requirements per camera are, so it is 
>>>> hard to say what level camera and processor you will need.
>>>>
>>>> My first feeling is you seem to be assuming a fairly cheep camera and then 
>>>> doing some fairly simple processing over the partial image, in which case 
>>>> you might even be able to live with a camera that uses a crude SPI interface 
>>>> to bring the frame in, and a very simple processor.
>>>
>>> I use A LOT of cameras.&nbsp; But, I should be able to swap the camera
>>> (upgrade/downgrade) and still rely on the same *local* compute engine.
>>> E.g., some of my cameras have Ir illuminators; it's not important
>>> in others; some are PTZ; others fixed.
>> 
>> Doesn't sound reasonable. If you downgrade a camera, you can't count on it 
>> being able to meet the same requirements, or you over speced the initial camera.
>
>Sorry, I was using up/down relative to "nominal camera", not "specific camera
>previously selected for application".  I'd 8really* like to just have a
>single "camera module" (module = CPU+I/O) instead of one for camera type A
>and another for camera type B, etc.
>
>> You put on a camera a processor capable of handling the tasks you expect out of 
>> that set of hardware.&nbsp; One type of processor likely can handle a variaty of 
>> different camera setup with
>
>Exactly.  If a particular instance has an Ir illuminator, then you include
>controls for that in *the* "camera module".  If another instance doesn't have
>this ability, then those controls go unused.
>
>>> Watching for an obstruction in the path of a garage door (open/close)
>>> has different requirements than trying to recognize a visitor at the front
>>> door.&nbsp; Or, identify the locations of the occupants of a facility.
>> 
>> Yes, so you don't want to "Pay" for the capability to recognize a visitor in 
>> your garage door sensor, so you use different levels of sensor/processor.
>
>Exactly.  But, the algorithms that do the scene analysis can be the same;
>you just parameterize the image and the objects within it that you seek.
>
>There will likely be some combinations that exceed the capabilities of
>the hardware to process in real-time.  So, you fall back to lower
>frame rates or let the algorithms drop targets ("You watch Bob, I'll
>watch Tom!")
>

Reply by Don Y ●December 31, 20222022-12-31

On 12/31/2022 3:40 PM, George Neuner wrote:
> 
>> [Hope you are faring well... enjoying the COLD!  ;) ]
> 
> Not very. Don't think I have your latest email.

Hmmm... I wondered why I hadn't heard from you!   (I trashed a bunch
of email aliases trying to shake off spammers -- you know, the
businesses that "need" your email address in order for you to
place an order... and then feel like you will be DELIGHTED to
receive an ongoing stream of solicitations!  The problem with
aliases is that you can't "undelete" them -- they get permanently
excised from the mail domain's name space, for obvious reasons!)

> On Fri, 30 Dec 2022 14:59:39 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
> 
>> On 12/30/2022 11:02 AM, Richard Damon wrote:
>>>>>> So, my options are:
>>>>>> - reduce the overall frame rate such that N cameras can
>>>>>>  &nbsp;&nbsp; be serviced by the USB (or whatever) interface *and*
>>>>>>  &nbsp;&nbsp; the processing load
>>>>>> - reduce the resolution of the cameras (a special case of the above)
>>>>>> - reduce the number of cameras "per processor" (again, above)
>>>>>> - design a "camera memory" (frame grabber) that I can install
>>>>>>  &nbsp;&nbsp; multiply on a single host
>>>>>> - develop distributed algorithms to allow more bandwidth to
>>>>>>  &nbsp;&nbsp; effectively be applied
>>>>>
>>>>> The fact that you are starting for the concept of using "USB Cameras" sort
>>>>> of starts you with that sort of limit.
>>>>>
>>>>> My personal thought on your problem is you want to put a "cheap" processor
>>>>> right on each camera using a processor with a direct camera interface to
>>>>> pull in the image and do your processing and send the results over some
>>>>> comm-link to the center core.
>>>>
>>>> If I went the frame-grabber approach, that would be how I would address the
>>>> hardware.&nbsp; But, it doesn't scale well.&nbsp; I.e., at what point do you throw in
>>>> the towel and say there are too many concurrent images in the scene to
>>>> pile them all onto a single "host" processor?
>>>
>>> Thats why I didn't suggest that method. I was suggesting each camera has its
>>> own tightly coupled processor that handles the need of THAT
>>
>> My existing "module" handles a single USB camera (with a fairly heavy-weight
>> processor).
>>
>> But, being USB-based, there is no way to look at *part* of an image.
>> And, I have to pay a relatively high cost (capturing the entire
>> image from the serial stream) to look at *any* part of it.
>>
>> *If* a "camera memory" was available, I would site N of these
>> in the (64b) address space of the host and let the host pick
>> and choose which parts of which images it wanted to examine...
>> without worrying about all of the bandwidth that would have been
>> consumed deserializing those N images into that memory (which is
>> a continuous process)
> 
> That's the way all cameras work - at least low level.  The camera
> captures a field (or a frame, depending) on its CCD, and then the CCD
> pixel data is read out serially by a controller.
> 
> What you are looking for is some kind of local frame buffering at the

Exactly.  And, bring that buffer out to a set of pins for random
access -- like a DRAM (memory).  In that way, I could explore whatever
parts of the image I deemed necessary -- without paying a price
(bandwidth) to pull the image data into "my" memory.

> camera.  There are some "smart" cameras that provide that ... and also
> generally a bunch of image analysis functions that you may or may not
> find useful.  I haven't played with any of them in a long time, and
> when I did the image functions were too primitive for my purpose, so I
> really can't recommend anything.
> 
>>>> ISTM that the better solution is to develop algorithms that can
>>>> process portions of the scene, concurrently, on different "hosts".
>>>> Then, coordinate these "partial results" to form the desired result.
>>>>
>>>> I already have a "camera module" (host+USB camera) that has adequate
>>>> processing power to handle a "single camera scene".&nbsp; But, these all
>>>> assume the scene can be easily defined to fit in that camera's field
>>>> of view.&nbsp; E.g., point a camera across the path of a garage door and have
>>>> it "notice" any deviation from the "unobstructed" image.
>>>
>>> And if one camera can't fit the full scene, you use two cameras, each with
>>> there own processor, and they each process their own image.
>>
>> That's the above approach, but...
>>
>>> The only problem is if your image processing algoritm need to compare parts of
>>> the images between the two cameras, which seems unlikely.
>>
>> Consider watching a single room (e.g., a lobby at a business) and
>> tracking the movements of "visitors".  It's unlikely that an individual's
>> movements would always be constrained to a single camera field.  There will
>> be times when he/she is "half-in" a field (and possibly NOT in the other,
>> HALF in the other or ENTIRELY in the other).  You can't ignore cases where
>> the entire object (or, your notion of what that object's characteristics
>> might be) is not entirely in the field as that leaves a vulnerability.
> 
> I've done simple cases following objects from one camera to another,
> but not dealing with different angles/points of view - the cameras had
> contiguous views with a bit of overlap.  That made it relatively easy.

Yes.  Each camera needs to grok the physical space in order to
understand "references" provided by another observer into that
space.

For the garage door cameras, it's relatively simple:  you're
looking at a very narrow strip of 2-space (the plane of the
door) from opposing ends.  You *know* that the door opening
has the same physical dimensions as seen on each door jam
by the opposing cameras, even if it "appears differently" to
the two observers.  And, you know that anything seen by a
camera is located between that camera and its counterpart
(though it may not be visible to the counterpart).

What you don't know is how "thick" (along the vision axis)
the object might be (e.g., a person vs. a vehicle).  But, I
don't see that knowledge adding much value to warrant further
complicating the design.

> Following a person, e.g., seen quarter-behind in one camera, and
> tracking them to another camera that sees a side view - from the
> /other/ side -
> 
> Just following a person is easy, but tracking a specific person,
> particularly when multiple people are present, gets very complicated
> very quickly.

Yes.  You need enough detail to be able to distinguish *easily*
between candidates.  You're not just "counting bodies".

In the *home* environment, the actors are likely not malevolent;
it's in their best interest for the system to know who/where they
are.  But, I don't think that's necessarily true in commercial and
industrial environments.  Even though it is similarly in THEIR best
interests, I think the actors, there, are more likely to express
hostility towards their overlords in that way.

Reply by Dimiter_Popoff ●January 1, 20232023-01-01

On 12/31/2022 23:29, Don Y wrote:
> On 12/31/2022 1:13 PM, Dimiter_Popoff wrote:
>> On 12/31/2022 20:16, Don Y wrote:
>>> On 12/31/2022 4:15 AM, Dimiter_Popoff wrote:
>>>>> Serial protocols inherently deliver data in a predefined pattern
>>>>> (often intended for display).&nbsp; Scene analysis doesn't necessarily
>>>>> conform to that same pattern.
>>>>
>>>> Isn't there a camera doing a protocol which allows you to request
>>>> a specific area only to be transferred? RFB like, VNC does that
>>>> all the time.
>>>
>>> That only makes sense if you know, a priori, which part(s) of the
>>> image you might want to examine.&nbsp; E.g., it would work for
>>> "exposing" just the portion of the field that "overlaps" some
>>> other image.&nbsp; I can get fixed parts of partial frames from
>>> *other* cameras just by ensuring the other camera puts that
>>> portion of the image in a particular memory object and then
>>> export that memory object to the node that wants it.
>>>
>>> But, if a target can move into or out of the exposed area, then
>>> you have to make a return trip to the camera to request MORE of
>>> the field.
>>>
>>> When your targets are "far away" (like a surveillance camera
>>> monitoring a parking lot), targets don't move from their
>>> previous noted positions considerably from one frame to the
>>> next.
>>>
>>> But, when the camera and targets are in close proximity,
>>> there's greater (apparent) relative motion in the same
>>> frame-interval.&nbsp; So, knowing where (x,y+WxH)) the portion of
>>> the image of interest lay, previously, is less predictive
>>> of where it may lie currently.
>>>
>>> Having the entire image available means the software
>>> can look <wherever> and <whenever>.
>>
>> Well yes, obviously so, but this is valid whatever the interface.
>> Direct access to the sensor cells can't be double buffered so
>> you will have to transfer anyway to get the frame you are analyzing
>> static.
> 
> I would assume the devices would have evolved an "internal buffer"
> (as I said, my experience with *DRAM* in this manner was 40+ years
> ago)
> 
>> Perhaps you could find a way to make yourself some camera module
>> using an existing one, MIPI or even USB, since you are looking for low
>> overall cost; and add some MCU board to it to do the buffering and
>> transfer areas on request. Or may be put enough CPU power together with
>> each camera to do most if not all of the analysis... Depending on
>> which achieves the lowest cost. But I can't say much on cost, that's
>> pretty far from me (as you know).
> 
> My current approach gives me that -- MIPS, size, etc.&nbsp; But, the cost
> of transferring parts of the image (without adding a specific mechanism)
> is a "shared page" (DSM).&nbsp; So, host (on node A) references part of
> node *B*s frame buffer and the page (on B) containing that memory
> address gets shipped back to node A and mapped into A's memory.

I assume A and B are connected over Ethernet via tcp/ip? Or are they
just two cores on the same chip or something?

> 
> ....
> 
> But, transport delays make this unsuitable for real-time work;
> a megabyte of imagery would require 100ms to transfer, in "raw"
> form.&nbsp; (I could encode it on the originating host; transfer it
> and then decode it on the receiving host -- at the expense of MIPS.
> This is how I "record" video without saturating the network)

100 ms latency would be an issue if you face say A-Train
(for you and the rest who have not watched "The Boys" - he is a
super (a "sup" as they have it) who can run fast enough to not
be seen by normal humans...) :-).

> 
> So, you (B) want to "abstract" the salient features of the image
> while it is on B and then transfer just those to A.&nbsp; *Use*
> them, on A, and then move on to the next set of features
> (that B has computed while A was busy chewing on the last set)
> 
> Or, give A direct access to the native data (without A having
> to capture video streams from each of the cameras that it wants
> to potentially examine)
> 

In RFB, the server can - and should - decide which parts of the
framebuffer have changed and send across only them. Which works
fine for computer generated images - plenty of single colour areas,
no noise etc.  In your case you might have to resort to jpeg
the image downgrading its quality so "small" changes would
disappear, I think those who write video encoders do something
like that (for my vnc server lossless  RLE was plenty, but it it
is not very efficient when the screen is some real life photo,
obviously).

Reply by Don Y ●January 1, 20232023-01-01

On 1/1/2023 7:04 AM, Dimiter_Popoff wrote:
>>> Perhaps you could find a way to make yourself some camera module
>>> using an existing one, MIPI or even USB, since you are looking for low
>>> overall cost; and add some MCU board to it to do the buffering and
>>> transfer areas on request. Or may be put enough CPU power together with
>>> each camera to do most if not all of the analysis... Depending on
>>> which achieves the lowest cost. But I can't say much on cost, that's
>>> pretty far from me (as you know).
>>
>> My current approach gives me that -- MIPS, size, etc.&nbsp; But, the cost
>> of transferring parts of the image (without adding a specific mechanism)
>> is a "shared page" (DSM).&nbsp; So, host (on node A) references part of
>> node *B*s frame buffer and the page (on B) containing that memory
>> address gets shipped back to node A and mapped into A's memory.
> 
> I assume A and B are connected over Ethernet via tcp/ip? Or are they
> just two cores on the same chip or something?

If A had direct access to the camera on B, then they'd be the same node,
right?  :>

A maps a memory object that has been defined on B into A's
memory space at a particular address (range).  So, A *pretends*
it has a local copy of the frame buffer.

When A references ANY address in that range, a fault causes a
request to be sent to B for a copy of the datum referenced.

Of course, it would be silly to just return that one value;
any subsequent references would also have to cause a page fault
(because you couldn't allow A to reference an adjacent address
for which data is not yet available, locally).  So, B ships a copy
of the entire page over to A and A instantiates that copy as a local
page marked as "present" (so, subsequent references incur no faults).

The application has fine-grained control over the *policy* that is
used, here.  So, if he knows that an access to address N will then
be followed by N+3000r16, he can arrange for the page containing
N *and* N+3000r16 to both be shipped over (so there isn't a
fault triggered when the N+3000r16 reference occurs).

[There are also provisions that allow multiple *writers*
to shared regions so the memory behaves, functionally (but
not temporally!) like LOCAL "shared memory"]

But, that's a shitload of overhead if you want to treat
the remote frame buffer AS IF it was local.

>> But, transport delays make this unsuitable for real-time work;
>> a megabyte of imagery would require 100ms to transfer, in "raw"
>> form.&nbsp; (I could encode it on the originating host; transfer it
>> and then decode it on the receiving host -- at the expense of MIPS.
>> This is how I "record" video without saturating the network)
> 
> 100 ms latency would be an issue if you face say A-Train
> (for you and the rest who have not watched "The Boys" - he is a
> super (a "sup" as they have it) who can run fast enough to not
> be seen by normal humans...) :-).

The bigger problem is throughput.  You don't care if all of your
references are skewed 100ms in time; add enough buffering to
ensure every frame remains available for that full 100ms and
just expect the results to be "late".

The problem happens when there's another frame coming before
you've finished processing the current frame.  And so on.

So, while it is "slick" and eliminates a lot of explicit remote
access code being exposed to the algorithm (e.g., "get me location
X,Y of the remote frame buffer"), it's just not practical for the
application.

>> So, you (B) want to "abstract" the salient features of the image
>> while it is on B and then transfer just those to A.&nbsp; *Use*
>> them, on A, and then move on to the next set of features
>> (that B has computed while A was busy chewing on the last set)
>>
>> Or, give A direct access to the native data (without A having
>> to capture video streams from each of the cameras that it wants
>> to potentially examine)
>>
> 
> In RFB, the server can - and should - decide which parts of the
> framebuffer have changed and send across only them. Which works
> fine for computer generated images - plenty of single colour areas,

Yes, but if the receiving end has no interest in those areas
of the image, then you're just wasting effort (bandwidth)
transfering them -- esp if the areas of interest will need
that bandwidth!

> no noise etc.&nbsp; In your case you might have to resort to jpeg
> the image downgrading its quality so "small" changes would
> disappear, I think those who write video encoders do something
> like that (for my vnc server lossless&nbsp; RLE was plenty, but it it
> is not very efficient when the screen is some real life photo,
> obviously).

I think the solution is to share abstractions.  Design the
algorithms so they can address partial "objects of interest"
and report on those.  Then, coordinate those partial results
to come up with a unified concept of what's happening in
the observed scene.

But, this is a fair bit harder than just trying to look at
a unified frame buffer and detect objects/motion!

OTOH, if it was easy, it would be boring ("What's to be learned
from doing something that's already been done?")

Reply by Dimiter_Popoff ●January 1, 20232023-01-01

On 1/1/2023 23:28, Don Y wrote:
> On 1/1/2023 7:04 AM, Dimiter_Popoff wrote:
>> ....
 >....
>> In RFB, the server can - and should - decide which parts of the
>> framebuffer have changed and send across only them. Which works
>> fine for computer generated images - plenty of single colour areas,
> 
> Yes, but if the receiving end has no interest in those areas
> of the image, then you're just wasting effort (bandwidth)
> transfering them -- esp if the areas of interest will need
> that bandwidth!

But nothing is stopping the receiving end to request a particular area
and the sending side sending just the changed parts of it.
I am not suggesting you use RFB, I use it just as an example.

> 
>> no noise etc.&nbsp; In your case you might have to resort to jpeg
>> the image downgrading its quality so "small" changes would
>> disappear, I think those who write video encoders do something
>> like that (for my vnc server lossless&nbsp; RLE was plenty, but it it
>> is not very efficient when the screen is some real life photo,
>> obviously).
> 
> I think the solution is to share abstractions.&nbsp; Design the
> algorithms so they can address partial "objects of interest"
> and report on those.&nbsp; Then, coordinate those partial results
> to come up with a unified concept of what's happening in
> the observed scene.

Well I think this is the way to go, too. This implies enough
CPU horsepowers per camera which nowadays might be practical.

> But, this is a fair bit harder than just trying to look at
> a unified frame buffer and detect objects/motion!

Well yes but you lose the framebuffer transfer problem, no
need to do your "remote virtual machine" for that etc.

> 
> OTOH, if it was easy, it would be boring ("What's to be learned
> from doing something that's already been done?")
> 

Not only that; if it were easy everyone else would be doing it :-).

Reply by George Neuner ●January 1, 20232023-01-01

On Sun, 1 Jan 2023 14:28:20 -0700, Don Y <blockedofcourse@foo.invalid>
wrote:

>On 1/1/2023 7:04 AM, Dimiter_Popoff wrote:
>
>The bigger problem is throughput.  You don't care if all of your
>references are skewed 100ms in time; add enough buffering to
>ensure every frame remains available for that full 100ms and
>just expect the results to be "late".
>
>The problem happens when there's another frame coming before
>you've finished processing the current frame.  And so on.
>
>So, while it is "slick" and eliminates a lot of explicit remote
>access code being exposed to the algorithm (e.g., "get me location
>X,Y of the remote frame buffer"), it's just not practical for the
>application.

All cameras have a free-run "demand" mode in which (between resets)
the CCD is always accumulating - waiting to be read out.  But many
also have a mode in they do nothing until commanded.

In any event, without command the controller will just service the CCD
- it won't transfer the image anywhere unless asked.

Many "smart" cameras can do ersatz stream compression by double
buffering internally and performing image subtraction to remove
unchanging (to some threshold) images.  In a motion activated
environment this can greatly cut down on the number of images YOU have
to process.

Better ones also offer a suite of onboard image processing functions:
motion detection, contrast expansion, thresholding, line finding ...
now even some offer pattern object recognition.  If the functions they
provide are useful, it can pay to take advantage of them.

I know you are (thinking of) designing your own ... you should maybe
think hard about what smarts you want onboard.

>> In RFB, the server can - and should - decide which parts of the
>> framebuffer have changed and send across only them. Which works
>> fine for computer generated images - plenty of single colour areas,
>
>Yes, but if the receiving end has no interest in those areas
>of the image, then you're just wasting effort (bandwidth)
>transfering them -- esp if the areas of interest will need
>that bandwidth!

That's true, but protocols like VNC's "copyrect" encoding essentially
divide the image into a large checkerboard, and transfers only those
"squares" where the underlying image has changed. What is considered a
"change" could be further limited on the sending side by
pre-processing: erosion and/or thresholding.

The biggest problem always is how much extra buffering you need for
as-yet-unprocessed images in the stream - while you're working on one
thing, you easily can lose something else.

>> no noise etc.&nbsp; In your case you might have to resort to jpeg
>> the image downgrading its quality so "small" changes would
>> disappear, I think those who write video encoders do something
>> like that (for my vnc server lossless&nbsp; RLE was plenty, but it it
>> is not very efficient when the screen is some real life photo,
>> obviously).

And RLE or copyrect can be combined further with lossless LZ
compression.

For really good results, wavelet compression is the best - it
basically reduces the whole image to a set of equation coefficients,
and you can preserve (or degrade) detail in the reconstructed image by
altering how many coefficients are calculated from the original.

But it is compute intensive: you really need a DSP or SIMD CPU to do
it efficiently.

>I think the solution is to share abstractions.  Design the
>algorithms so they can address partial "objects of interest"
>and report on those.  Then, coordinate those partial results
>to come up with a unified concept of what's happening in
>the observed scene.
>
>But, this is a fair bit harder than just trying to look at
>a unified frame buffer and detect objects/motion!
>
>OTOH, if it was easy, it would be boring ("What's to be learned
>from doing something that's already been done?")

As I said previously, smart cameras can do things like motion
detection onboard, and report the AOI along with the image.

George

Reply by Don Y ●January 2, 20232023-01-02

On 1/1/2023 2:53 PM, Dimiter_Popoff wrote:
> On 1/1/2023 23:28, Don Y wrote:
>> On 1/1/2023 7:04 AM, Dimiter_Popoff wrote:
>>> ....
>  >....
>>> In RFB, the server can - and should - decide which parts of the
>>> framebuffer have changed and send across only them. Which works
>>> fine for computer generated images - plenty of single colour areas,
>>
>> Yes, but if the receiving end has no interest in those areas
>> of the image, then you're just wasting effort (bandwidth)
>> transfering them -- esp if the areas of interest will need
>> that bandwidth!
> 
> But nothing is stopping the receiving end to request a particular area
> and the sending side sending just the changed parts of it.
> I am not suggesting you use RFB, I use it just as an example.

I'm trying to hide the fact that there are bits of code (and I/O's)
operating on different processors.  I.e., a single processor would
<somehow> have all of these images accessible to it.  I'd like to
maintain that illusion by hiding any transfers/mapping "under the
surface" so the main algorithm can concentrate on the problem at
hand, and not the implementation platform.

It's like having virtual memory instead of forcing the application
to drag in "overlays" due to hardware constraints on the address
space.  Or, having to push data out to disk when the amount of
local memory is exceeded.

These are just nuisances that interfere with the design of the algorithm.

But, I may be able to use the shared memory mechanism as a way to "hint"
to the OS as to which parts of the image are of interest to the remote
node.  Then, arrange for the pager to only send the differences over
the wire -- counting on the local pager to instantiate a duplicate
copy of the previous image (which is likely still available on that
host).

I.e., bastardize CoW for the purpose.

>>> no noise etc.&nbsp; In your case you might have to resort to jpeg
>>> the image downgrading its quality so "small" changes would
>>> disappear, I think those who write video encoders do something
>>> like that (for my vnc server lossless&nbsp; RLE was plenty, but it it
>>> is not very efficient when the screen is some real life photo,
>>> obviously).
>>
>> I think the solution is to share abstractions.&nbsp; Design the
>> algorithms so they can address partial "objects of interest"
>> and report on those.&nbsp; Then, coordinate those partial results
>> to come up with a unified concept of what's happening in
>> the observed scene.
> 
> Well I think this is the way to go, too. This implies enough
> CPU horsepowers per camera which nowadays might be practical.

I've got enough for a single camera.  But, if I had to handle a
multicamera *scene* (completely) with that processor, I'd be
running out of MIPS.

>> But, this is a fair bit harder than just trying to look at
>> a unified frame buffer and detect objects/motion!
> 
> Well yes but you lose the framebuffer transfer problem, no
> need to do your "remote virtual machine" for that etc.

The question will be where the effort pays off quickest.  E.g.,
dropping the effective frame rate may make simpler solutions
more practical.

>> OTOH, if it was easy, it would be boring ("What's to be learned
>> from doing something that's already been done?")
> 
> Not only that; if it were easy everyone else would be doing it :-).

I have no problem letting other people invent wheels that I
can freely use.  Much of my current architecture is pieced
together from ideas gleaned over the past several decades
(admittedly, on bigger iron than "MCUs").  It's only now
that it is economically feasible for me to exploit some of these
technologies.

Reply by Don Y ●January 2, 20232023-01-02

On 1/1/2023 6:59 PM, George Neuner wrote:
> On Sun, 1 Jan 2023 14:28:20 -0700, Don Y <blockedofcourse@foo.invalid>
> wrote:
> 
>> On 1/1/2023 7:04 AM, Dimiter_Popoff wrote:
>>
>> The bigger problem is throughput.  You don't care if all of your
>> references are skewed 100ms in time; add enough buffering to
>> ensure every frame remains available for that full 100ms and
>> just expect the results to be "late".
>>
>> The problem happens when there's another frame coming before
>> you've finished processing the current frame.  And so on.
>>
>> So, while it is "slick" and eliminates a lot of explicit remote
>> access code being exposed to the algorithm (e.g., "get me location
>> X,Y of the remote frame buffer"), it's just not practical for the
>> application.
> 
> All cameras have a free-run "demand" mode in which (between resets)
> the CCD is always accumulating - waiting to be read out.  But many
> also have a mode in they do nothing until commanded.

The implication in my comments was that you would want to target a
certain frame rate as a performance metric.  Whether that has to
be the cameras nominal rate or something slower than that would
likely depend on the scene being analyzed.

> In any event, without command the controller will just service the CCD
> - it won't transfer the image anywhere unless asked.
> 
> Many "smart" cameras can do ersatz stream compression by double
> buffering internally and performing image subtraction to remove
> unchanging (to some threshold) images.  In a motion activated
> environment this can greatly cut down on the number of images YOU have
> to process.
> 
> Better ones also offer a suite of onboard image processing functions:
> motion detection, contrast expansion, thresholding, line finding ...
> now even some offer pattern object recognition.  If the functions they
> provide are useful, it can pay to take advantage of them.

I don't yet know what will be useful.  So far, my algorithms have been
two-dimensional versions of photo-interrupters.  I don't care what I'm
seeing, just that I'm seeing it in a certain place under certain
conditions.

Visually tracking targets will be considerably harder.

Previously, I required the targets to wear a beacon that I could
locate wirelessly.  This works because it gives the user a means
of interacting with the system audibly without having to clutter
the space with utterances (and sort out what's intentional and
what is extraneous).

But, that only makes sense for folks using "personal audio".
Anyone without such a device would be invisible to the system.

Switching to vision will (?) let me allow anyone in the arena
to interact/participate.  And, can potentially let nonverbal
users interact without having to wear a "transducer".

> I know you are (thinking of) designing your own ... you should maybe
> think hard about what smarts you want onboard.

Thinking hard is easy.  Knowing WHAT to think about is hard!

>>> In RFB, the server can - and should - decide which parts of the
>>> framebuffer have changed and send across only them. Which works
>>> fine for computer generated images - plenty of single colour areas,
>>
>> Yes, but if the receiving end has no interest in those areas
>> of the image, then you're just wasting effort (bandwidth)
>> transfering them -- esp if the areas of interest will need
>> that bandwidth!
> 
> That's true, but protocols like VNC's "copyrect" encoding essentially
> divide the image into a large checkerboard, and transfers only those
> "squares" where the underlying image has changed. What is considered a
> "change" could be further limited on the sending side by
> pre-processing: erosion and/or thresholding.

I would assume you could adaptively size regions so you looked at the
cost of sending the contents AND size information for multiple smaller
regions vs. just the contents (size implied) for larger ones -- which
may only contain a small amount of deltas.

> The biggest problem always is how much extra buffering you need for
> as-yet-unprocessed images in the stream - while you're working on one
> thing, you easily can lose something else.

Yes.  The "easy" approach is to treat it as HRT and plan on processing
every frame in a single frame time.  Latency can be large-ish -- as long
as throughput is guaranteed.

>>> no noise etc.&nbsp; In your case you might have to resort to jpeg
>>> the image downgrading its quality so "small" changes would
>>> disappear, I think those who write video encoders do something
>>> like that (for my vnc server lossless&nbsp; RLE was plenty, but it it
>>> is not very efficient when the screen is some real life photo,
>>> obviously).
> 
> And RLE or copyrect can be combined further with lossless LZ
> compression.
> 
> For really good results, wavelet compression is the best - it
> basically reduces the whole image to a set of equation coefficients,
> and you can preserve (or degrade) detail in the reconstructed image by
> altering how many coefficients are calculated from the original.
> 
> But it is compute intensive: you really need a DSP or SIMD CPU to do
> it efficiently.

Time spent compressing and decompressing equates with time on the
wire, transferring UNcompressed data.  There's a point at which it's
probably smarter to just use more network bandwidth than waste
MIPS trying to conserve it.

My immediate concern is making a "wise" (not necessarily "optimal")
HARDWARE implementation decision so I can have some boards cut.
And, start planning the sort of capabilities/feature that I can develop
with those facilities.

>> I think the solution is to share abstractions.  Design the
>> algorithms so they can address partial "objects of interest"
>> and report on those.  Then, coordinate those partial results
>> to come up with a unified concept of what's happening in
>> the observed scene.
>>
>> But, this is a fair bit harder than just trying to look at
>> a unified frame buffer and detect objects/motion!
>>
>> OTOH, if it was easy, it would be boring ("What's to be learned
>>from doing something that's already been done?")
> 
> As I said previously, smart cameras can do things like motion
> detection onboard, and report the AOI along with the image.

Previous 2 34Next

Camera interfaces

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group