Newer
Older
Classes:
Data:
This class will handle Data needed by the CMS, provides storage
for modules, manages the database connection and maybe also
contains some caches. At the moment it only provides access to
the Validator.
The two predicates SH_Data_check_tag and SH_Data_check_attr are
wrappers to the appropriate methods of the validator. These are
needed, as there shouldn't be direct calls to the internal
structure of SH_Data.
The modifying methods are not exposed, as the validator
shouldn't be changed while others depend on it, this has to be
implemented later.
Data also contains a wrapper for the self-closing tag predicate.
Attr:
The structure SH_Attr implements an HTML Attribute.
For every function there is also a static method/function,
which can perform the same work, but doesn't rely on really
having a single struct Attr. This is useful for example in an
array to manipulate a single element.
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
@subsubsection NodeFragment - Attributes
The method SH_NodeFragment_get_attr provides a pointer to an Attr, by
its index. Note, that it directly points to the internal data, instead
of copying the data to a new Attr, which would be unneccessary overhead,
if only reading access is needed. That's why it is also a const pointer.
If the user intends to modify it, a copy should be taken via
SH_Attr_copy.
Multiple insert methods allow either to add an existing Attr, or to
create a new one implicitly. If the Attr is not already used beforehand,
it is more efficient to call the attr_new methods. Also an old Attr is
freed, after it was inserted, thus it can't be used afterwards. This is
neccessary, as for efficiency reasons an array of Attr is used directly,
instead of the indirect approach of storing a pointer of Attr. This
means, that the contents of the Attr has to be copied to the internal
structure. If the old Attr would be left unfreed, there would be two
Attrs, the original one and the implicit one, referring to the same
data, which would lead to at least data corruption, or undefined
behaviour like a double free, which would be a serious threat for a
library which is to be used on a webserver. ...
For each of the two insert modes, there is a method to prepend, append
or insert at a specific position. An incorrect position is handled
inside of the external method and an E_VALUE is thrown. The internal
method doesn't handle this, so special care must be taken to not make
undefined behaviour. However enforcing this check would be unneccessary
overhead for the prepend and append methods, which are known to have
correct indicies, as well for other internal methods, where the internal
method may be used.
Two alternatives are provided: remove_attr and pop_attr. While the
former free's the Attr's data, the latter allocates a new Attr, to store
and return the data. Both functionality is provided by a single
(internal) static method.
@subsubsection Childs
A Fragment can contain childs. When building the html, the
childs html is generated where appropiate.
The methods
- SH_Fragment_get_child (by index)
- SH_Fragment_is_child (non recursive) and
- SH_Fragment_is_descendant (recursive)
were added.
Fragment can be copied, either recursive (copying also all
childs) or nonrecursive (ignoring the childs, thus the copy
has always no childs).
Adding the same element twice in the tree (graph) isn't
possible, as this would lead to problems e.g. double free or
similar quirks.
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
The single method (formerly SH_NodeFragment_append_child) to add a child
at the end of the child list was replaced, by a bunch of methods to
insert a child at the beginning (SH_NodeFragment_prepend_child), at the
end (SH_NodeFragment_append_child), at a specific position
(SH_NodeFragment_insert_child) and directly before
(SH_NodeFragment_insert_child_before) or after another child
(SH_NodeFragment_insert_child_after). All these methods are implemented
by a single internal one (insert_child), as there isn't really much
difference in inserting one or the other way.
But this internal method doesn't check whether this insertion request is
actually doable, to save overhead as not every insertion method requires
this check. This is done by the respective method. However if the check
is not done correctly the internal method will attempt to write at not
allocated space, which will hopefully result in a segfault.
The child list is implemented as an array. To reduce the overhead to
realloc calls, the array is allocated in chunks of childs. The
calculation how many has to be allocated is done by another static
method and determined by the macro CHILD_CHUNK. This is set to 5, which
is just a guess. It should be somewhere around the average number of
childs per html element, to reduce unused overhead.
Also some predicates (SH_NodeFragment_is_parent,
SH_NodeFragment_is_ancestor) were added to check whether a relationship
exists between to nodes, thus whether they are linked through one or
multiple levels. These functions could replace the old ones
(SH_NodeFragment_is_child, SH_NodeFragment_is_descendant) semantically.
Furthermore they are more efficient as this is now possible to check
over the parent pointer. The internal insert method also uses these
methods to check whether the child node is actually a parent of the
parent node, which would result in errors later one.
The old test is now obsolete but remained, as it is not bad to test
more.
Various remove methods were added, which are all implemented by an
static method, analog to the insert methods.
A Fragment can output it's html. If there is an error the method
aborts and returns NULL.
This method also pays attention to self-closing tags, which is
determined via the validator.
When the wrap mode is used, after each tag a newline is started.
Also the html is indented, which can be configured by the
parameters indent_base, indent_step and indent_char. The
parameter indent_base specifies the width the first tag should
be indented with, while indent_step specifies the increment of
the indent when switching to a child tag. The character, that is
used for indenting is taken from indent_char. (It could also be
a string longer than a single character).
This arguments can't be set by the user, but are hardcoded
(by now).
The to_html method generates also the html for the attributes.
Note, that there is no escaping of the quotes, the values are
wrapped with. But this is also somewhat consistent, as there is
no syntax validation on the tags either.
(i.e. no '<' inside of a tag)
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
The TextFragment is used to implement the text between and
outside html tags. Currently, it is still very rudimentary in,
that it doesn't support any operations at all and just has a
function to expose a internal text.
While this function is necessary to manipulate the content of a
TextFragment, the TextFragment should abstract the semantics of
Text. While simple wrapper functions for appending are to be
added, methods purely manipulating the text, i.e. relying on
the text's contents, wont get wrapper functions. Thus this
function is still needed until a more sophisticated approach is
implemented.
Some basic text functionality is already supported via wrapper
functions.
Note that wrapper functions aren't tested in unit tests.
When a newline is encountered in the text, a <br /> is inserted
and for wrap mode also a newline and an indent is inserted.
Note, that the indent is still missing at the front where it
can't be inserted yet as SH_Text is still lacking basic
functionality.
The html generation for both TextFragment and NodeFragment
combined is tested. As the encoding semantics of the
TextFragments are neither defined nor implemented, some tests
are marked as XFAIL.
What is still missing is the proper treatment of embed text.
This should be indented and breaked at 72/79/80. Also newlines
and special chars should be replaced on generation, maybe also
giving some way of preventing XSS. Regarding the NodeFragment
there should be some adjustments to further adjust the styling,
which of course should also be reflected by TextFragment. This
should also include the generation of self-closing tags.
Furthermore the html generation should be based on a single
text object, to which is added to. This will later on also
enable to directly send generated parts over the network while
still generating some data.
Validator:
Validator serves as an syntax checker, i.e. it can be requested
whether a tag is allowed.
On initialization (of data), the Validator's knowledge is filled
with some common tags. This is of course to be replaced later,
by some dynamic handling.
When a tag is made known to the Validator, which it already
knows, the old id is returned and nothing is added.
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
The Validator saves the tags as an array. Now also another information
is added, which slots aren't used currently to spare expensive calls to
realloc. This led to a mere reimplementation of the functions. Tags
can't be deleted by now, but the adding function supports reusing empty
slots. Also the reading functions have to determine, whether a slot can
be read or is empty.
The tests were adjusted, but are buggy, so they should be rewritten in
the future.
A registered tag can be deregistered by calling SH_Validator_deregister.
The data is removed, but the space is not deallocated, if it is not at
the end. This prevents copying data on removal and saves expensive calls
to realloc. Instead the empty space is added to the list of free blocks,
which allows to refill these spaces, if a new tag is being registered.
The space is finally deallocated, if the validator is being deallocated
or the tag written in the last block is removed. In this case, heavy
iteration is performed, as the list of free blocks is not ordered. The
next last tag at that time is determined by iterating over the list of
free blocks until some it is not found.
Note that even if there can be a lot of gaps in between, the Validator
will not allocate more space until all these gaps are refilled when a
new tag is registered, thus new space is only being allocated, if there
is really not enough space left.
Due to the 4 nested loops, there was an issue related to the
72(80)-column rule. It can't be abided without severely impacting the
readability of the code.
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
Originally the ids were intended to be useful for linking different
information together internally, and for providing references
externally. However, they weren't used internally, for this, pointers
seamed to be more useful, as they also allow to directly access the data
and also have a relation defined.
Regarding reference purposes, they aren't really needed, and it is more
convenient to directly use some strings, and they aren't more
performant, as there still have to be internal checks and looking for an
int isn't more performant, then looking for a pointer.
Also, they have to be stored, so they need more memory and also some
code, to be handled.
While it was very clever, the complex data structure of the tag array
introduced in 'Validator: restructured internal data (a0c9bb2)' comes
with a lot of runtime overhead. It reduces the calls to free and
realloc, when a lot of tags are deleted and inserted subsequently, but
burdens each call with a loop over the linked list of free blocks.
This is even more important, as validator must be fast in checking, as
this is done every time something is inserted into the DOM-tree, but has
not so tight requirements for registering new tags, as this is merely
done at startup time.
As the access must be fast, the tags are sorted when inserted, so that
the search can take place in log-time.
There is a method to add a set of tags to a validator on initialisation.
First this removes a user application from the burden of maintaining the
html spec and also is more performant, as a lot of tags are to be
inserted at once, so there aren't multiple allocation calls.
As the validator needs the tags to be in order, the tags must be sorted
on insertion. Of course it would be easier for the code, if the tags
were already in order, but first there could be easily a mistake and
second sorting the tags by an algorithm allows the tags to be specified
in a logically grouped and those more maintainable order.
For the sorting, insertion sort is used. Of course it has a worse
quadratic time complexity, but in a constructor, I wouldn't introduce
the overhead of memory managment a heap- or mergesort would introduce
and in-place sorting is also out, because the data lies in ro-memory.
Thus I choose an algorithm with constant space complexity. Also the
'long' running time is not so important, as the initilization only runs
at startup once and the tags are not likely to exceed a few hundred so
even a quadratic time isn't that bad.
Each tag has a type as defined by the html spec. This must be provided
on registration. Implicitly registering tags, when an attribute is
registered can't be done anymore, as the type information would be
missing.
The added parameterin register_tag, as well as the change of behaviourin
register_attr has broken a lot of tests, that had to be adjusted
therefor.
Added self-closing predicate. Other predicates may follow.
The Validator contains already all HTML5 tags.
Tags according to:
https://html.spec.whatwg.org/dev/indices.html#elements-3
Types according to:
https://html.spec.whatwg.org/multipage/syntax.html#elements-2
Retrieved 04. 10. 2023
A attribute can be deregistered by calling SH_Validator_deregister_attr.
Note that deregistering an attr, that was never registered is considered
an error, but this may change, as technically it is not registered
afterwards and sometimes (i.e. for a blacklist) it might be preferable
to ensure, that a specific attr is not registered, but it is not clear
whether there should be an error or not.
Also the deallocating of the data used for an attr was moved to an extra
method, as this is needed in several locations and it might be subject
to change.
The Validator can check if a attribute is allowed in a tag. It does so
by associating allowed tags with attributes. This is done in that way,
to support also attributes which are allowed for every tag (global
attributes), but this is not yet supported. So some functions allow for
NULL to be passed and some will still crash.
The predicate SH_Validator_check_attr returns whether an attribute is
allowed for a specific tag. If tag is NULL, it returns whether an attr
is allowed at all, not whether it is allowed for every tag. For this
another predicate will be provided, when this is to be implemented.
The method SH_Validator_register_attr registers an tag-attr combination.
Note, that it will automatically call SH_Validator_register_tag, if the
tag doesn't exist. Later it will be possible, to set tag to NULL to
register a global attribute, but for now the method will crash.
The method SH_Validator_deregister_attr removes a tag-attr combination
registered earlier. Note, that deregistering a non existent combination
will result in an error. This behaviour is arguable and might be subject
to change. When setting only tag to NULL, all tags for this attribute
are deregistered. When setting only attr to NULL, all attrs for this tag
are deregistered. This might suffer from problems, if this involves some
attrs, that are global. Also this will use the internal method
remove_tag_for_all_attrs, which has the problem, that it might fail
partially. Normally when failing all functions revert the program to the
same state, as it was before the call. This function however is
different, as if it fails there might be some combinations, that haven't
been removed, but others are already. Nevertheless, the validator is
still in a valid state, so it is possible to call this function a second
time, but it is not sure, which combinations are already deregistered.
As the attrs also use the internal strings of the tags, it must be
ensured, when a tag is deregistered, that all remaining references are
removed, otherwise there would be dangling pointers. Note, that for this
also remove_tag_for_all_attrs is used, so the method
SH_Validator_deregister_tag suffers from the same problems listed above.
Also if this internal method fails, the tag won't be removed at all.
Similar to the tags, the attributes can be initialized. Missing tags are
automatically added. The declaration syntax is currently a bit annoying,
as the tags, that belong to an attribute, either have to be declared
explicitly or a pointer to the tag declaration must be given, but then
only concurrent tags are possible.
Support for global attributes is likewise missing; it must be ensured,
that (tag_n != 0) && (tags != NULL). Otherwise validator will be
inconsistent and there might be a bug.
Global attributes are represented by empty attributes. A global
attribute is an attribute, that is accepted for any tag.
It is refused to remove a specific tag for a global attribute, as this
would mean to "localize" the tag, thus making it not global anymore.
The method to do that and a predicate for globalness is missing yet.
Deregistering a global attribute normally is not possible, as basically
every other tag has to be added. This was implemented now.
Originally it was intended to provide the caller with the information,
that a global attribute has to be converted into a local one before
removal. However such internals should not be exposed to the caller. As
it stands there is no real reason to inform a caller, whether an
attribute is local or global. Also, there is a problem that the
predicate is burdened with the possibility, that the attribute doesn't
exists, thus it can't return a boolean directly. Both is why, the
predicate isn't added yet.
Also a bug was detected in the method remove_tag_for_all_attrs. It
removes an attribute while also iterating over it, thus potentially
skipping over some attribute and maybe also invoking undefined behaviour
by deallocating space after the array.
Copying a Validator could be useful if multiple html versions are to be
supported. Another use case is a blacklist XSS-Scanner.
Text:
This is a data type to deal with frequently appending to a string.
The space a Text has for saving the string is allocated in chunks.
To request additional space SH_Text_enlarge is called. If the
requested size fits inside the already allocated space or is even
smaller than the current size, nothing is done. Otherwise a
multiple of chunk size is allocated being equal or greater than
the requested size. The chunk size can be changed by changing
the macro CHUNK_SIZE in src/text.h. The default is 64.
The adjustment is done automatically when a string is added.
SH_Text_append_string can be used to append a string to the text,
SH_Text_append_text can be used to append another text to the text.
SH_Text_join is a wrapper for SH_Text_append_text, but also frees
the second text, thus joining the texts to a single one.
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
The constructor SH_Text_new_from_string accepts a string, with that the
text is initialized. This can replace the so far needed two calls
SH_Text_new and SH_Text_append_string.
The (intern) implementation of SH_Text was changed from an array of
char, to a single linked list of arrays of char. This allows an easier
implementation of (further) text manipulation.
The API hasn't changed much, but SH_Text_join can't yield an error
anymore, so it now doesn't support passing an error and returns nothing.
The method SH_Text_get_char returns a single character by a given index.
If the index is out of range, NULL is returned and error->type is set to
VALUE_ERROR.
The function SH_Text_get_string returns a substring of text beginning at
index and of length offset. If index is out of bounds, NULL is returned
and an error is set. If offset is out of bounds, the existent part is
returned. Also the length of the returned string can be set (optionally)
to the out parameter length.
If the original behaviour of SH_Text_get_string is achieved,
SH_Text_get_string (text, length, error) has to be changed to
SH_Text_get_string (text, 0, SIZE_MAX, length, error). The only
difference will be that the function won't fail, when the text is longer
than SIZE_MAX, because it is told to stop there. A text that is longer
than SIZE_MAX is not possible to be returned, but that wasn't possible
at anytime. Also I don't think handling char[] longer than SIZE_MAX is
possible with the standard C library. Those in this case the text can
only be returned in parts (By now only possible till 2*SIZE_MAX-1 with
calling SH_Text_get_string (text, SIZE_MAX, SIZE_MAX, length, error))
or has to be manipulated using the appropriate SH_Text methods, which are
not implemented yet.
The function SH_Text_get_range returns a string beginning at start and
ending at end. Note that end specifies the char, that is not returned
any more. Thus the function implements something similar, as the pythonic
slice syntax (text[start:end]). In opposition to the behaviour there,
calling SH_Text_get_range with start > end is undefined behaviour. If
start == end, the empty string is returned.
If start is out of bounds, NULL is returned and an error is set. If end
is out of bounds, the existent part is returned. Also the length of the
returned string can be set (optionally) to the out parameter length.
The function SH_Text_get_length returns the length of the text. As the
text also supports being longer than SIZE_MAX, this method can fail on
runtime. If the text is longer then SIZE_MAX, the Text returns SIZE_MAX
and sets error to DOMAIN_ERROR. Note, that due to the implementation,
this is a non trivial function, so don't use it to exhaustively.
The method SH_Text_print just prints the whole string to stdout.
The function SH_Text_set_char allows to write a single character to a
position, that already exists in the text. Thus overwriting another
character. If the index is out of range, a value error is set and FALSE
is returned.
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
It was tried to implement the text in terms of multiple text
segments.
While it would be preferable, it doesn't seam to be possible to
abstract over the internals of text_segment. That's why only
some basic functionality is moved, but whether more is to
follow, is not known yet.
A text_segment allocates memory in terms of chunks, this is now
also done, when created from a string, but this means that we
can't rely on strdup any more, as it takes care of the
allocation. Calling malloc ourselves shouldn't be such an
overhead as at least glibc's strdup performs the exact same
steps. Actually we should be spare a strlen call now, so it
should be more performant.
The copy_and_replace function replaces a single character with
a string, while copying. This may be replaced by an elaborate
function as manipulating a text normally means that
manipulating is deferred until needed, which this function
contradicts to.
Also there is the concept of a text mark.
A mark will be used to point to a specific location inside of a
text. Currently it can't do anything and isn't even used.